{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9154e182",
    "papermill": {
     "duration": 0.028107,
     "end_time": "2023-09-06T15:22:01.811311",
     "exception": false,
     "start_time": "2023-09-06T15:22:01.783204",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Titanic: training a Gradient Boosting Classifier\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0961d9e4",
    "papermill": {
     "duration": 0.024707,
     "end_time": "2023-09-06T15:22:01.861828",
     "exception": false,
     "start_time": "2023-09-06T15:22:01.837121",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "https://www.kaggle.com/competitions/titanic\n",
    "\n",
    "\n",
    "The goal of this Notebook is to train a **Gradient Boosting** model to give a solution to the famous **Titanic Competition** from the Kaggle platform. In this particular project, we will use this model to make predictions in our *streamlit app* (refer to ```titanic_streamlit/main_app/streamlit_app.py```\n",
    "\n",
    "https://www.kaggle.com/code/marcpaulo/titanic-playground-for-new-kagglers-0-78\n",
    "\n",
    "In another notebook named ```titanic_streamlit/notebooks/training_playground.ipynb``` you can find an exhaustive exploration of the *Titanic Classification problem*. It includes an in-depth analysis and testing: *Exploratory Data Analysis*, *Data Preprocessing* with *Sklearn Pipelines*, and *Hyperparameter Optimization* and *Model selection*.\n",
    "\n",
    "The notebook presented here is a short version of the aforementioned ```training_playground```. Here, we directly train the *Gradient Boosting Classifier* (the best model found) and save it (*pickle*) for future use in ```titanic_streamlit/main_app/streamlit_app.py```\n",
    "\n",
    "**P.S** the model trained here achieves a **0.78468 score in the Kaggle competition**, occupying position *1777/14731*, **TOP 13%**. (last update: 07/09/2023)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "XEJJkwGbj0jl"
   },
   "outputs": [],
   "source": [
    "# Enter your Project Path in which the 'titanic_streamlit' folder is located:\n",
    "\n",
    "notebook_config = {\n",
    "    'your_project_path': '<YOUR_PATH_HERE>/titanic_streamlit',  # TODO: fill this!!!\n",
    "    \n",
    "    'random_state': 12345,  # for the GradientBoostingClassifier\n",
    "    'n_jobs': 1 ,           # for the cross_val_score\n",
    "    'cv': 10,               # for the cross_val_score\n",
    "    \n",
    "    'save_model': True,\n",
    "    'model_file_name': 'trained_grad_boost.pkl',  # where the model is saved\n",
    "    'run_sanity_check': True  # try to load the model afer it's saved\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
    "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
    "execution": {
     "iopub.execute_input": "2023-09-06T15:22:01.967319Z",
     "iopub.status.busy": "2023-09-06T15:22:01.966908Z",
     "iopub.status.idle": "2023-09-06T15:22:03.612200Z",
     "shell.execute_reply": "2023-09-06T15:22:03.610855Z"
    },
    "id": "57d326ac",
    "papermill": {
     "duration": 1.678182,
     "end_time": "2023-09-06T15:22:03.615040",
     "exception": false,
     "start_time": "2023-09-06T15:22:01.936858",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import pickle\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-06T15:22:03.718396Z",
     "iopub.status.busy": "2023-09-06T15:22:03.718010Z",
     "iopub.status.idle": "2023-09-06T15:22:03.757632Z",
     "shell.execute_reply": "2023-09-06T15:22:03.756241Z"
    },
    "id": "84939747",
    "papermill": {
     "duration": 0.069358,
     "end_time": "2023-09-06T15:22:03.760594",
     "exception": false,
     "start_time": "2023-09-06T15:22:03.691236",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "data_path = notebook_config['your_project_path'] + '/data/train.csv'\n",
    "df_train = pd.read_csv(data_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-06T15:22:03.813339Z",
     "iopub.status.busy": "2023-09-06T15:22:03.812580Z",
     "iopub.status.idle": "2023-09-06T15:22:03.819693Z",
     "shell.execute_reply": "2023-09-06T15:22:03.818775Z"
    },
    "id": "79c1d11a",
    "papermill": {
     "duration": 0.035887,
     "end_time": "2023-09-06T15:22:03.821973",
     "exception": false,
     "start_time": "2023-09-06T15:22:03.786086",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# lowercase all column names\n",
    "df_train.columns = [c.lower() for c in df_train.columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nvTwchwYfInh"
   },
   "source": [
    "# 1. Exploratory Data Analysis & Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 258
    },
    "execution": {
     "iopub.execute_input": "2023-09-06T15:22:03.926067Z",
     "iopub.status.busy": "2023-09-06T15:22:03.924894Z",
     "iopub.status.idle": "2023-09-06T15:22:03.955257Z",
     "shell.execute_reply": "2023-09-06T15:22:03.953909Z"
    },
    "id": "33e18a49",
    "outputId": "1cc28a31-3e50-40c9-d0af-9fffdefa87a1",
    "papermill": {
     "duration": 0.059881,
     "end_time": "2023-09-06T15:22:03.958238",
     "exception": false,
     "start_time": "2023-09-06T15:22:03.898357",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "df_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-09-06T15:22:04.011314Z",
     "iopub.status.busy": "2023-09-06T15:22:04.010681Z",
     "iopub.status.idle": "2023-09-06T15:22:04.016693Z",
     "shell.execute_reply": "2023-09-06T15:22:04.015515Z"
    },
    "id": "b3d74efb",
    "outputId": "924fa368-26ca-4478-b60a-672ba1f768d6",
    "papermill": {
     "duration": 0.035398,
     "end_time": "2023-09-06T15:22:04.019567",
     "exception": false,
     "start_time": "2023-09-06T15:22:03.984169",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(df_train.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-06T15:22:04.423031Z",
     "iopub.status.busy": "2023-09-06T15:22:04.422609Z",
     "iopub.status.idle": "2023-09-06T15:22:04.431388Z",
     "shell.execute_reply": "2023-09-06T15:22:04.430121Z"
    },
    "id": "06fc9a37",
    "papermill": {
     "duration": 0.038451,
     "end_time": "2023-09-06T15:22:04.433874",
     "exception": false,
     "start_time": "2023-09-06T15:22:04.395423",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "df_train = df_train.drop(columns=['passengerid', 'name', 'cabin', 'ticket'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-06T15:22:09.099101Z",
     "iopub.status.busy": "2023-09-06T15:22:09.098690Z",
     "iopub.status.idle": "2023-09-06T15:22:09.104630Z",
     "shell.execute_reply": "2023-09-06T15:22:09.103471Z"
    },
    "id": "71aad67f",
    "papermill": {
     "duration": 0.039006,
     "end_time": "2023-09-06T15:22:09.107052",
     "exception": false,
     "start_time": "2023-09-06T15:22:09.068046",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# there are some outliers in 'fare' columns,\n",
    "# let's cut the maximum value to be 300\n",
    "df_train.loc[df_train['fare'] > 300, 'fare'] = 300"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-06T15:22:10.800363Z",
     "iopub.status.busy": "2023-09-06T15:22:10.799720Z",
     "iopub.status.idle": "2023-09-06T15:22:10.806259Z",
     "shell.execute_reply": "2023-09-06T15:22:10.805341Z"
    },
    "id": "1fce4610",
    "papermill": {
     "duration": 0.041071,
     "end_time": "2023-09-06T15:22:10.808625",
     "exception": false,
     "start_time": "2023-09-06T15:22:10.767554",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "## a new feature: num_relatives = sibsp + parch\n",
    "df_train['num_relatives'] = df_train['sibsp'] + df_train['parch']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-06T15:22:11.958953Z",
     "iopub.status.busy": "2023-09-06T15:22:11.958193Z",
     "iopub.status.idle": "2023-09-06T15:22:12.540436Z",
     "shell.execute_reply": "2023-09-06T15:22:12.539316Z"
    },
    "id": "8b2f77e1",
    "papermill": {
     "duration": 0.617988,
     "end_time": "2023-09-06T15:22:12.543166",
     "exception": false,
     "start_time": "2023-09-06T15:22:11.925178",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.pipeline import Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-09-06T15:22:12.606826Z",
     "iopub.status.busy": "2023-09-06T15:22:12.606413Z",
     "iopub.status.idle": "2023-09-06T15:22:12.616362Z",
     "shell.execute_reply": "2023-09-06T15:22:12.615059Z"
    },
    "id": "03d4e30b",
    "papermill": {
     "duration": 0.044581,
     "end_time": "2023-09-06T15:22:12.618946",
     "exception": false,
     "start_time": "2023-09-06T15:22:12.574365",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# First, create three small pipelines to stack more than one Transformer\n",
    "# for a specific feature. Those features that only require one\n",
    "# transformation are handled by the final 'preprocessor' here below.\n",
    "age_pipe = Pipeline(steps=[\n",
    "    ('age_imp', SimpleImputer(strategy='median')),\n",
    "    ('age_scale', MinMaxScaler())\n",
    "])\n",
    "fare_pipe = Pipeline(steps=[\n",
    "    ('fare_imp', SimpleImputer(strategy='mean')),\n",
    "    ('fare_scale', MinMaxScaler())\n",
    "])\n",
    "embarked_pipe = Pipeline(steps=[\n",
    "    ('embarked_imp', SimpleImputer(strategy='most_frequent')),\n",
    "    ('embarked_onehot', OneHotEncoder(drop=None))\n",
    "])\n",
    "\n",
    "# Let's create the final 'preprocessor'\n",
    "preprocessor = ColumnTransformer(\n",
    "    transformers=[\n",
    "        ('age_pipe', age_pipe, ['age']),\n",
    "        ('fare_pipe', fare_pipe, ['fare']),\n",
    "        ('embarked_pipe', embarked_pipe, ['embarked']),\n",
    "        ('minmax_scaler', MinMaxScaler(), ['sibsp', 'parch', 'num_relatives']),\n",
    "        ('pclass_onehot', OneHotEncoder(drop=None), ['pclass']),\n",
    "        ('sex_onehot', OneHotEncoder(drop='first'), ['sex'])\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ede030aa",
    "papermill": {
     "duration": 0.032134,
     "end_time": "2023-09-06T15:22:12.683070",
     "exception": false,
     "start_time": "2023-09-06T15:22:12.650936",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# 2. Gradient Boosting Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "EChzTJrVf3dJ"
   },
   "outputs": [],
   "source": [
    "y_train = df_train['survived'].values  # [0,1]\n",
    "df_train = df_train.drop(columns=['survived'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "execution": {
     "iopub.execute_input": "2023-09-06T15:23:37.348438Z",
     "iopub.status.busy": "2023-09-06T15:23:37.348024Z",
     "iopub.status.idle": "2023-09-06T15:24:18.815558Z",
     "shell.execute_reply": "2023-09-06T15:24:18.814404Z"
    },
    "id": "6a163024",
    "outputId": "49882331-ce3c-4d27-f17b-ebab1c218ed2",
    "papermill": {
     "duration": 41.54614,
     "end_time": "2023-09-06T15:24:18.855692",
     "exception": false,
     "start_time": "2023-09-06T15:23:37.309552",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import GradientBoostingClassifier\n",
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "\n",
    "grad_boost = Pipeline(steps=[\n",
    "    ('preprocessor', preprocessor),\n",
    "    ('grad_boost', GradientBoostingClassifier(\n",
    "        n_estimators=100,\n",
    "        learning_rate=0.1,\n",
    "        random_state=notebook_config['random_state']\n",
    "    ))\n",
    "])\n",
    "\n",
    "grad_boost_acc = cross_val_score(\n",
    "    estimator=grad_boost,\n",
    "    X=df_train,\n",
    "    y=y_train,\n",
    "    scoring='accuracy',\n",
    "    cv=notebook_config['cv'],\n",
    "    n_jobs=notebook_config['n_jobs']\n",
    ")\n",
    "\n",
    "print('best GradientBoosting acc (mean) =', round(np.mean(grad_boost_acc), 2))\n",
    "print('best GradientBoosting acc (std)  =', round(np.std(grad_boost_acc), 2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 225
    },
    "id": "1bpvW3PzjHSj",
    "outputId": "5f524964-ba90-4a41-92e6-f3c34671aa24"
   },
   "outputs": [],
   "source": [
    "grad_boost = Pipeline(steps=[\n",
    "    ('preprocessor', preprocessor),\n",
    "    ('grad_boost', GradientBoostingClassifier(\n",
    "        n_estimators=100,\n",
    "        learning_rate=0.1,\n",
    "        random_state=notebook_config['random_state']\n",
    "    ))\n",
    "])\n",
    "\n",
    "grad_boost.fit(df_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "SDk8IfkQgWql"
   },
   "outputs": [],
   "source": [
    "# Save the trained model\n",
    "\n",
    "save_model_path = (\n",
    "        notebook_config['your_project_path'] + \n",
    "        '/models/' + \n",
    "        notebook_config['model_file_name']\n",
    "    )\n",
    "\n",
    "if notebook_config['save_model']:\n",
    "    \n",
    "    \n",
    "    with open(save_model_path, 'wb') as out_file:\n",
    "        pickle.dump(grad_boost, out_file)\n",
    "    print(f\"Grad Boost model saved in:\\n'{save_model_path}'\")\n",
    "\n",
    "else:\n",
    "\n",
    "    print('According to the notebook_config, the model is NOT saved')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "30Az4tz5kVsX"
   },
   "outputs": [],
   "source": [
    "# SANITY CHECK: Load the model\n",
    "\n",
    "if notebook_config['run_sanity_check']:\n",
    "    with open(save_model_path, 'rb') as in_file:\n",
    "        loaded_model = pickle.load(in_file)\n",
    "\n",
    "    print(f\"Grad Boost model loaded from:\\n'{save_model_path}'\")\n",
    "    print('train score:', loaded_model.score(df_train, y_train))\n",
    "\n",
    "else:\n",
    "    \n",
    "    print('According to the notebook_config, do NOT run Sanity Check')"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 157.935357,
   "end_time": "2023-09-06T15:24:26.682445",
   "environment_variables": {},
   "exception": null,
   "input_path": "__notebook__.ipynb",
   "output_path": "__notebook__.ipynb",
   "parameters": {},
   "start_time": "2023-09-06T15:21:48.747088",
   "version": "2.4.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
